This document is intended for the cleaning of data scraped from yelp.
The general layout of the data is as follows:
yelp <- read.csv('../data/yelpData.csv', stringsAsFactors = FALSE)
colnames(yelp) <- tolower(colnames(yelp))
## drop breweries that couldn't be scraped
yelp <-yelp[!is.na(yelp$ratings),]
yelp
yelp %>%
group_by(name) %>%
summarize(n()) %>%
nrow()
## [1] 2465
Of the nearly 8,000 breweries on open brewery db, I was able to successfully scrape 2465 of them on the first pass.
Off the bat, I was a little worried because rating appears null across the board, but I see that it’s redundant with ratings, which is present. Some notes about specific vars:
moreInfoVar The trickiest part will be handling moreInfoVar, as this capures idiosyncratic additional business detail, the features of which vary across businesses. It’s represented in long form for now, but I’ll want to experiment with getting it in wide so each brewery can be represented as a single observation.
Hours I’ll probably want to parse the hours into total number of hours open each day. But this doesn’t capture whether a brewery is open earlier / later. Maybe try to code for that with a different variable. Watch out for the Open now that’s stored in the Mon hours. This will show up in the Tues hours in the data I’m collecting today.
claimed_status Dunno what this is, seems to be constant across breweries so might as well drop.
health_rating I really like the idea of this one, but based on the first 400 breweries, this var is very sparse and heterogeneous.
price_range This one is important. Looks like it’s represented with either an adjective (e.g., moderate) or price range (e.g., $11-30).
ratings reviews obv the dependent vars. I’ll have to think about exactly how to negotiate the fact that there’s two of them.
ratings_histogram I like the idea of capturing this variability. But this is information relating to the DV, so including it as a feature doesn’t make sense. I’ll have to think about that.
I’ll dig into formatting the hours first.
One straight-forward task to start with is converting the hours for each day into new vars that just have the total amount of hours open on that day. Later on I’ll want to consider how to capture the information about how early / late those hours are.
print(typeof(yelp$mon))
## [1] "character"
print(class(yelp$mon))
## [1] "character"
s1 <- yelp$mon[1]
s2 <- yelp$tue[1]
print(c(s1,s2))
## [1] "12:00 pm - 10:00 pm\n \n Open now"
## [2] "12:00 pm - 10:00 pm"
Alright so, for all of these, I’ll want to keep only the data before the first \n, then I’ll want to split the range into two vars for the beginning and the end.
Then convert to time, compute the total hours that a brewery is open, reshape it all so that all hours-of-operation information for each brewery is on one row, and join it back with the main data.
## Regex to extract time
simpleTime <- '\\d+:\\d\\d\\s\\w\\w'
## keep one row per brewery
test <- yelp[match(unique(yelp$name), yelp$name),]
## Regex and time conversion to get time data to behave like time data
test <- test %>%
select(name, mon, tue, wed, thu, fri, sat, sun) %>%
gather(day, times, mon:sun) %>%
## Separate into open and close times based on the '-'
separate(times, into=c('open', 'close'), sep='-') %>%
## match only strings matching time format
mutate(open = str_extract(open, simpleTime),
close = str_extract(close, simpleTime))# %>%
## Warning: Expected 2 pieces. Additional pieces discarded in 31 rows [848,
## 1644, 2133, 2661, 3313, 4109, 4598, 5126, 5156, 5778, 6574, 7063, 7591,
## 7621, 8243, 9039, 9528, 10056, 10708, 11238, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3075 rows
## [4, 10, 15, 17, 18, 19, 21, 24, 26, 27, 28, 34, 36, 38, 41, 42, 44, 52, 63,
## 69, ...].
## Convert to time
test$open <- strptime(test$open, format = '%I:%M %p')
test$close <- strptime(test$close, format = '%I:%M %p')
## Strip out the date, keep only the time
test$open <- times(sub('.*\\s+', '', test$open))
test$close <- times(sub('.*\\s+', '', test$close))
## Doing some serious data reshaping to get time information on one line per brewery
## Goal is to spread time open, time closed, and total time to wide vars of the form: day_open/close_TotalTime/Not
yelp <- x <- test %>%
## Compute total hours open
mutate(totalHours = abs(test$close - test$open)) %>%
## Gather open / close times
gather(timePeriod, time, open:close) %>%
## Gather total hours and open / close times
gather(TotalOrTime, value, totalHours, time) %>%
## Combine day, open / close times, total time
unite(variable, day, timePeriod, TotalOrTime) %>%
## Spread across the united variable
spread(variable, value) %>%
inner_join(yelp)
## Joining, by = "name"
## open and close for total hours are the same thing
sum(yelp$thu_open_totalHours == yelp$thu_close_totalHours, na.rm = TRUE) == nrow(yelp[!is.na(yelp$thu_open_totalHours),])
## [1] TRUE
yelp <- yelp %>%
select(-contains('_close_totalhours'))
yelp
Whew, that was tough.
I’m going to get together a function for sorting the various levels of price_range into bins to be used as an ordinal factor.
print(summary(factor(yelp$price_range)))
## $11-30 $31-60 A$16-35 Inexpensive
## 722 13837 159 100 9189
## Moderate Pricey SGD16-30 Ultra High-End Under $10
## 8075 65 5 14 765
I think the general idea will be to make three levels: ['cheap', 'moderate', 'expensive'] where the cutoffs are as follows:
< $10 = cheap>= $10 & < $30 = moderate> $30 = expensivecleanPriceRange <- function(price_range){
### takes as input char vector price_range
### returns price range (still as a char var) sorted into three groups
cheapBucket <- c('under $10', 'inexpensive')
moderateBucket <- c('$11-30', 'A$16-35', 'Moderate')
expensiveBucket <- c('$31-60')
price_range <- tolower(price_range)
## regrettably (bc we need vectorized operations) there's no great way around the disgusting ifelse() chain
out <- ifelse(price_range %in% cheapBucket, 'cheap',
ifelse(price_range %in% moderateBucket, 'moderate',
ifelse(price_range %in% expensiveBucket, 'expensive', NA)))
return(out)
}
yelp <- yelp %>%
mutate(price_range_messy = price_range) %>%
select(-price_range) %>%
mutate(price_range = cleanPriceRange(price_range_messy))
print(summary(factor(yelp$price_range)))
## cheap expensive moderate NA's
## 9954 159 13837 8981
yelp[500:550,]
yelp <- yelp %>%
select(-price_range_messy)
Alright that feature should be mostly good to go. Might want to dummy code it out later, but for now it’s fine.
Time to deal with the nuanced info vars. The mission will be to try and convert them to wide.. but we’ll see how it goes.
I’ll need to get a sense of the degree of variablity in the moreinfovar variable.
summary(factor(yelp$moreinfovar))
## Accepted Cards Accepts Apple Pay Accepts Credit Cards
## 12 1950 2500
## Accepts Cryptocurrency Accepts Google Pay Alcohol
## 699 1137 877
## Ambience Attire Best Nights
## 724 578 328
## Bike Parking By Appointment Only Caters
## 2335 7 2201
## Coat Check Delivery Dogs Allowed
## 307 633 1113
## Drive-Thru Gender Neutral Restrooms Good For
## 44 696 532
## Good For Dancing Good for Groups Good For Happy Hour
## 383 888 1008
## Good for Kids Good for Working Happy Hour Specials
## 1080 11 42
## Has Dairy-free Options Has Gluten-free Options Has Halal Options
## 78 93 7
## Has Keto Options Has Kosher Options Has Pool Table
## 11 10 357
## Has Soy-free Options Has TV Legal ID
## 18 882 1
## Liked by Vegans Liked by Vegetarians Music
## 157 333 61
## Noise Level Offers Military Discount Open to All
## 1217 182 154
## Outdoor Seating Parking Smoking
## 904 2117 297
## Take-out Takes Reservations Waiter Service
## 2472 850 407
## Wheelchair Accessible Wi-Fi
## 1352 886
Hmm, yea this is tough. Some of these (e.g., Has Soy-Free Options & Has Keto Options) are only represented very few times. Trying to convert these to wide will result in NAs pretty much across the board. I do want to avoid pruning these, as there is much more data yet to be collected. I might just try to convert to wide, then implement a NA quality threshold before moving on to analysis. The basic idea will be to drop vars with too many NAs, and tackle imputing values for those that seem worth saving.
Here we go:
## Some breweries have redundant entries for moreinfovar (80 total), need to drop these for spread to work
badBrews <- yelp %>%
group_by(name) %>%
summarize(dups = sum(duplicated(moreinfovar))) %>%
filter(dups >= 1) %>%
select(name)
yelp <- yelp[!(yelp$name %in% badBrews$name),]
yelp <- yelp %>%
spread(moreinfovar, moreinfoval)
colnames(yelp) <- tolower(colnames(yelp))
yelp
This seems surprisingly alright. Luckily, most of these features seem to be coded as Yes / No – I’ll have to think about how to handle the ones that aren’t (e.g., Parking)
The following function computes the proportion of missing data (from most to least) for all variables:
realizing i could’ve done this with sapply(), but this formats it nicely
summarizeNAs <- function(data) {
## Returns a sorted df where column one is the name of the original column and column two is the proportion of data missing from that column
holder <- data.frame(colName = character(), propMissing = numeric(), stringsAsFactors = FALSE)
count <- 0
for (colIndex in 1:(ncol(data))) {
count <- count+ 1
holder[count, 1] <- colnames(data)[colIndex]
holder[count, 2] <- nrow(data[is.na(data[,colIndex]),]) / nrow(data)
}
return(holder[order(holder$propMissing, decreasing = TRUE),])
}
naSummary <- summarizeNAs(yelp)
summarizeNAs(yelp)
Unfortunately, as of now, there are only very few of these special info variables (5) that have < 20% of values missing. For now, I’m going to drop all vars > 50% missing values.
yelp <- yelp[, !(colnames(yelp) %in% naSummary[naSummary$propMissing > .5, 1])]
This leaves the following vars to work with:
colnames(yelp)
## [1] "name" "fri_close_time"
## [3] "fri_open_time" "fri_open_totalhours"
## [5] "mon_close_time" "mon_open_time"
## [7] "mon_open_totalhours" "sat_close_time"
## [9] "sat_open_time" "sat_open_totalhours"
## [11] "sun_close_time" "sun_open_time"
## [13] "sun_open_totalhours" "thu_close_time"
## [15] "thu_open_time" "thu_open_totalhours"
## [17] "tue_close_time" "tue_open_time"
## [19] "tue_open_totalhours" "wed_close_time"
## [21] "wed_open_time" "wed_open_totalhours"
## [23] "x1_star_count" "x2_star_count"
## [25] "x3_star_count" "x4_star_count"
## [27] "x5_star_count" "fri"
## [29] "mon" "sat"
## [31] "sun" "thu"
## [33] "tue" "wed"
## [35] "category" "city"
## [37] "claimed_status" "health_rating"
## [39] "ratings" "reviews"
## [41] "street" "price_range"
## [43] "accepts apple pay" "accepts credit cards"
## [45] "bike parking" "caters"
## [47] "parking" "take-out"
## [49] "wheelchair accessible"
Writing a quick mode-imputation function:
getmode <- function(v) {
v <- na.omit(v)
uniqv <- unique(v)
## tabulate takes the integer-valued vector bin and counts the number of times each integer occurs in it.
uniqv[which.max(tabulate(match(v, uniqv)))]
}
modeImputation <- function(data){
for (column in colnames(data)) {
#print(column)
modeValue <- getmode(data[,column])
data[is.na(data[,column]), column] <- modeValue #rep(modeValue, length(data[is.na(data$column), column]))
}
return(data)
}
yelp <- modeImputation(yelp)
summarizeNAs(yelp)
Very good.
Quickly dealing with the parking variable:
yelp <- yelp %>%
## If there's more than one parking option just label it 'multiple'
mutate(parking = ifelse(grepl(',', parking), 'multiple', parking))
Let’s convert some of these to dummy variables:
colnames(yelp) <- gsub('-', '_', colnames(yelp))
colnames(yelp) <- tolower(gsub(' ', '_', colnames(yelp)))
dummyCols <- yelp %>%
## im gonna leave the category var alone for now
select(price_range:wheelchair_accessible)
yelp <- dummy_cols(yelp, select_columns = colnames(dummyCols), remove_first_dummy = TRUE)
colnames(yelp) <- tolower(colnames(yelp))
## Converting so dummy vars reflect yes value rather than no
yelp <- yelp %>%
mutate(accepts_credit_cards_yes = ifelse(accepts_credit_cards_no == 0, 1, 0),
bike_parking_yes = ifelse(bike_parking_no == 1, 0, 1),
wheelchair_accessible_yes = ifelse(wheelchair_accessible_no == 1, 0, 1),
accepts_apple_pay_yes = ifelse(accepts_apple_pay_no == 1, 0, 1)) %>%
select(-bike_parking_no, -accepts_credit_cards_no, -claimed_status, -accepts_apple_pay_no, -wheelchair_accessible_no, -(fri:wed))
yelp <- yelp %>%
mutate(reviews = as.numeric(str_extract(reviews, '\\d+')))#,
#my_ratings = (x1_star_count * 1 + x2_star_count * 2 + x3_star_count * 3 + x4_star_count * 4 + x5_star_count * 5) / sum(c(x1_star_count,x2_star_count,x3_star_count,x4_star_count,x5_star_count)))
## Had to resort to a for loop to calculate weighted ratings
store <- numeric()
for (row in 1:(nrow(yelp))) {
store[row] <- with(yelp[row,], (x1_star_count * 1 + x2_star_count * 2 + x3_star_count * 3 + x4_star_count * 4 + x5_star_count * 5) / sum(c(x1_star_count,x2_star_count,x3_star_count,x4_star_count,x5_star_count)))
}
yelp$my_ratings <- store
yelp %>%
gather(rating_type, rating, ratings, my_ratings) %>%
mutate(rating_type = fct_recode(rating_type, 'Yelp Ratings' = 'ratings', 'My Converted Ratings' = 'my_ratings')) %>%
ggplot(aes(x = rating)) + geom_density(fill = 'blue', alpha = .8) + xlab('Ratings') + facet_wrap(~rating_type) +
theme_bw() +
theme(strip.background = element_rect(fill = 'white', color = 'black'))
Calculating the weighted scores by hand allows for more variablility.
I think that pretty much wraps it up.
yelp
breweries <- read.csv('../../../data/breweries.csv')
census <- read.csv('../../../data/censusData.csv')
colnames(census) <- tolower(colnames(census))
census <- census[!(census$name %in% c('United States', 'Puerto Rico Commonwealth', 'District of Columbia')),]
census <- census %>%
select(name, popestimate2017) %>%
rename(state = name, population = popestimate2017)
yelp <- breweries %>%
select(name, state) %>%
inner_join(yelp, by = 'name')
## Warning: Column `name` joining factor and character vector, coercing into
## character vector
breweries <- breweries %>%
inner_join(census) %>%
select(name, state, population) %>%
group_by(state) %>%
summarize(population = max(population), nBreweries = n()) %>%
mutate(breweriesToPpl = nBreweries / population)
## Joining, by = "state"
## Warning: Column `state` joining factors with different levels, coercing to
## character vector
yelp <- yelp %>%
inner_join(breweries, by = 'state')
## Warning: Column `state` joining factor and character vector, coercing into
## character vector
write.csv(yelp, '../data/cleanYelp.csv', row.names = FALSE)